devtools::load_all(".")
library(patchwork)uk_cqr <- readr::read_rds(here::here("data_results", "uk_cqr.rds"))
uk_cqr |> dplyr::count(model)
## # A tibble: 6 x 2
## model n
## <chr> <int>
## 1 epiforecasts-EpiExpert 4784
## 2 epiforecasts-EpiExpert_direct 4784
## 3 epiforecasts-EpiExpert_Rt 4784
## 4 EuroCOVIDhub-baseline 4784
## 5 EuroCOVIDhub-ensemble 4784
## 6 seabbs 4784The UK Dataset contains predictions from six different models.
Each of these models is connected to exactly two values for every combination of target_type, horizon, quantile and target_end_date(i.e. \(2 \cdot 4 \cdot 23 \cdot 13 = 2392\) combinations):
One value with the original prediction and one value for the adjusted prediction after applying the CQR method.
First, one might be interested in any particular covariate combination.
The corresponding prediction interval before and after CQR can be plotted with the plot_intervals() function:
plot_intervals(
uk_cqr,
model = "seabbs", target_type = "Cases", horizon = 3, quantile = 0.1
)To visualize the trend along the horizon dimension or for different quantiles for each target_type separately, we can use the plot_intervals_grid() function:
plot_intervals_grid(
uk_cqr,
model = "seabbs", quantile = 0.1, facet_by = "horizon"
)The plots reveal two main findings:
CQR seems to make the prediction intervals larger.
Since the seabbs forecasts are contributed by a single human, this finding confirms the hypothesis that humans tend to be too confident in their own forecasts leading to narrow prediction intervals.
A larger forecast horizon strongly correlates with higher uncertainty and, thus, wider prediction intervals.
According to this impression CQR produces adjusted, often larger, interval forecasts. However, up to now we can not make a statement if the post-processed predictions are actually ‘better’.
scoringutils packageWe evaluate the quality of forecast intervals with the Weighted Interval Score.
This metric is implemented by the scoringutils package.
It provides very fine control about the granularity of evaluation.
Let us first replicate the setting from the first plot and analyze if CQR actually improved the forecasts. Since we are primarily interested in out-of-sample performance we filter the data frame down to the validation set:
uk_cqr |>
extract_validation_set() |>
scoringutils::score() |>
scoringutils::summarise_scores(by = c("method", "model", "target_type", "horizon", "quantile")) |>
dplyr::filter(model == "seabbs", target_type == "Cases", horizon == 3, quantile == 0.1) |>
dplyr::select(method:dispersion)
## # A tibble: 2 x 7
## method model target_type horizon quantile interval_score dispersion
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cqr seabbs Cases 3 0.1 157. 87.5
## 2 original seabbs Cases 3 0.1 211. 38.1Indeed, CQR leads to a lower interval score by increasing the dispersion/spread resulting in larger intervals compared to the original forecasts.
In contrast, we can aggregate over all models, target types, horizons and quantiles and evaluate the overall performance of the CQR method:
uk_cqr |>
extract_validation_set() |>
scoringutils::score() |>
scoringutils::summarise_scores(by = c("method")) |>
dplyr::select(method:dispersion)
## # A tibble: 2 x 3
## method interval_score dispersion
## <chr> <dbl> <dbl>
## 1 cqr 62.2 24.1
## 2 original 65.7 12.0The result shows the same trend: better and wider overall prediction intervals.
Of particular interest might be the question which models benefitted most from CQR adjustments.
This question can, of course, be answered in a similar way to above by including model in the by argument of summarise_scores().
eval_methods() and plot_eval()In addition to the absolute interval score change, which can lose meaning in case of very different scales for each model, the relative or percentage change allows for direct comparisons on a level playing field.
The eval_methods() function displays the relative change of CQR compared to the original predictions.
Negative values indicate a score improvement, positive values on the other hand a larger and therefore worse interval score.
Note that the function automatically detects included post-processing methods from the input data frame and thus the string 'method' does not have to be provided.
Moreover, all displayed values are computed exclusively from the validation data.
eval_methods(uk_cqr, summarise_by = "model")
## # A tibble: 6 x 2
## model relative_change
## <chr> <dbl>
## 1 epiforecasts-EpiExpert -0.0568
## 2 epiforecasts-EpiExpert_direct -0.0551
## 3 epiforecasts-EpiExpert_Rt -0.061
## 4 EuroCOVIDhub-baseline 0.0088
## 5 EuroCOVIDhub-ensemble -0.0205
## 6 seabbs -0.0849The output reveals that CQR lead to improved performance for all models except EuroCOVIDhub-baseline.
For many categories the magnitudes of the relative change in the output table might still be difficult to compare, even when sorted in increasing order.
Thus, the table can be visualized either by bars (only available for a single specified category such as model in this case) or by a heatmap (the default):
df_eval <- eval_methods(uk_cqr, summarise_by = "model")
p1 <- plot_eval(df_eval, heatmap = FALSE, base_size = 8) + ggplot2::labs(y = NULL)
p2 <- plot_eval(df_eval, base_size = 8) + ggplot2::labs(y = NULL)
p1 + p2
These two functions are not restricted to a single category (or even a single post-processing method).
To illustrate the multi-category case let’s plot the relative improvements of CQR for each commbination of
model and quantile and model and horizon, respectively:
df_mod_quant <- eval_methods(uk_cqr, summarise_by = c("model", "quantile"))
p1 <- plot_eval(df_mod_quant, base_size = 8) + ggplot2::labs(x = NULL)
df_mod_hor <- eval_methods(uk_cqr, summarise_by = c("model", "horizon"))
p2 <- plot_eval(df_mod_hor, base_size = 8) + ggplot2::labs(y = NULL)
p1 + p2Interestingly, for the EuroCOVIDhub-baseline model predicting constant values, CQR leads to worse interval scores for quantiles in the middle range as well as for extreme horizons but not vice versa!
Overall, however, the trends are similar to our conclusions before:
Extreme quantiles and large horizons benefit most.
It is worth noting that, starting from the plot output, you can always revert to the exact numerical values simply by analyzing the input data frame directly:
df_mod_hor
## # A tibble: 6 x 5
## model `1` `2` `3` `4`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 epiforecasts-EpiExpert 0.0039 0.0109 -0.0634 -0.0883
## 2 epiforecasts-EpiExpert_direct -0.0038 -0.0026 -0.0584 -0.0795
## 3 epiforecasts-EpiExpert_Rt 0.0153 0.0153 -0.0596 -0.109
## 4 EuroCOVIDhub-baseline 0.024 -0.0731 0.0019 0.0647
## 5 EuroCOVIDhub-ensemble 0.0002 0.0049 -0.0289 -0.0301
## 6 seabbs -0.0294 -0.0093 -0.102 -0.109eval_methods()In case all of the values in a particular row or column of the eval_methods() output have the same sign, it might be useful to group the values in e.g. below and above average improvements.
One way to average ratios (here the difference in scores divided by the original score) is to take the geometric mean.
This can be achieved for rows and/or columns with the row_averages and col_averages arguments to eval_methods():
df_mod_tar <- eval_methods(
uk_cqr,
summarise_by = c("model", "target_type"),
row_averages = TRUE, col_averages = TRUE
)
df_mod_tar
## # A tibble: 7 x 4
## model Cases Deaths average_change
## <chr> <dbl> <dbl> <dbl>
## 1 epiforecasts-EpiExpert -0.0568 -0.0106 -0.034
## 2 epiforecasts-EpiExpert_direct -0.0551 -0.0131 -0.0343
## 3 epiforecasts-EpiExpert_Rt -0.0609 -0.0976 -0.0795
## 4 EuroCOVIDhub-baseline 0.0094 -0.129 -0.0622
## 5 EuroCOVIDhub-ensemble -0.0205 -0.014 -0.0173
## 6 seabbs -0.0849 -0.0893 -0.0871
## 7 <NA> -0.0453 -0.0601 -0.0527The last row computes geomtric means for each column and the last column computes geometric means for each row.
Since target_type only has two possible values, the geometric mean in the last column always lies strictly in between the two values of the corresponding row.
This table can be visualized in the same way as before:
plot_eval(df_mod_tar)The plot shows that CQR improvements are sometimes stronger for Cases and sometimes for Deaths depending on the model.
Finally, one might analyze both the performance for a combination of two categories as well as the marginal performance for a single category. This can be done separately:
eval_methods(uk_cqr, summarise_by = "model")
eval_methods(uk_cqr, summarise_by = "horizon")
eval_methods(uk_cqr, summarise_by = c("model", "horizon"))However, jumping around between three different tables can be annoying.
For that reason the marginal changes can be added to the two-dimensional table with the margins argument:
eval_mod_hor <- eval_methods(
uk_cqr,
summarise_by = c("model", "horizon"), margins = TRUE
)Due to the large number of entries a picture is again convenient:
plot_eval(eval_mod_hor)While the marginal performance of all models except EuroCOVIDhub-baseline is improved by applying CQR, the interaction between those models and low horizons do not benefit from post-processing.
hub_1 <- readr::read_rds(here::here("data_results", "hub_cqr_1.rds"))
hub_2 <- readr::read_rds(here::here("data_results", "hub_cqr_2.rds"))
hub_cqr <- dplyr::bind_rows(hub_1, hub_2)For analyzing the impact of CQR on the European Hub Data, we selected a subset of six different models. In contrast to the UK Dataset we can now analyze performance differences between \(18\) European countries.
hub_cqr |> dplyr::count(model)
## # A tibble: 6 x 2
## model n
## <chr> <int>
## 1 epiforecasts-EpiNow2 192096
## 2 EuroCOVIDhub-baseline 192096
## 3 EuroCOVIDhub-ensemble 192096
## 4 IEM_Health-CovidProject 192096
## 5 ILM-EKF 192096
## 6 USC-SIkJalpha 192096
hub_cqr |> dplyr::count(location_name)
## # A tibble: 18 x 2
## location_name n
## <chr> <int>
## 1 Austria 64032
## 2 Bulgaria 64032
## 3 Croatia 64032
## 4 Denmark 64032
## 5 Finland 64032
## 6 Germany 64032
## 7 Greece 64032
## 8 Italy 64032
## 9 Latvia 64032
## 10 Luxembourg 64032
## 11 Malta 64032
## 12 Norway 64032
## 13 Poland 64032
## 14 Portugal 64032
## 15 Romania 64032
## 16 Slovakia 64032
## 17 Slovenia 64032
## 18 United Kingdom 64032Let us first get a quick impression for the predictions in Germany:
plot_intervals_grid(hub_cqr, model = "EuroCOVIDhub-ensemble", location = "DE", quantiles = 0.1)Here the CQR adjustments are more interesting than for the UK Data:
CQR appears to extend the prediction intervals for Cases whereas it reduced the spread for Deaths.
We can check if this statement holds averaged across all models and quantiles:
hub_cqr |>
dplyr::filter(location_name == "Germany") |>
scoringutils::score() |>
scoringutils::summarise_scores(by = c("method", "target_type")) |>
dplyr::arrange(target_type) |>
dplyr::select(method:dispersion)
## # A tibble: 4 x 4
## method target_type interval_score dispersion
## <chr> <chr> <dbl> <dbl>
## 1 cqr Cases 21.1 10.2
## 2 original Cases 22.3 7.87
## 3 cqr Deaths 0.177 0.0923
## 4 original Deaths 0.181 0.0896While CQR improves the interval score for both target_types, the dispersion also increases in both cases such that the previous finding does not generalize.
Next, we consider the effects of all models for each country:
df_mod_loc <- eval_methods(hub_cqr, summarise_by = c("model", "location_name"))
plot_eval(df_mod_loc)This plot is incredibly uninformative since Poland seems to be a huge outlier for all models, but particularly for the IEM_Health-CovidProject predictions.
Here, CQR makes the forecast intervals around \(400\)% worse, which is quite hard to believe.
Let us briefly look into the scores for Poland in more detail by computing the raw scores on the training and validation set for CQR separately:
cqr_poland <- hub_cqr |>
dplyr::filter(location_name == "Poland")
cqr_poland |>
extract_training_set() |>
scoringutils::score() |>
scoringutils::summarise_scores(by = c("method", "model")) |>
dplyr::arrange(model) |>
dplyr::select(method:dispersion)
## # A tibble: 12 x 4
## method model interval_score dispersion
## <chr> <chr> <dbl> <dbl>
## 1 cqr epiforecasts-EpiNow2 22.8 7.24
## 2 original epiforecasts-EpiNow2 23.7 6.24
## 3 cqr EuroCOVIDhub-baseline 31.7 9.88
## 4 original EuroCOVIDhub-baseline 35.8 4.01
## 5 cqr EuroCOVIDhub-ensemble 29.4 9.75
## 6 original EuroCOVIDhub-ensemble 31.4 7.54
## 7 cqr IEM_Health-CovidProject 56.5 17.9
## 8 original IEM_Health-CovidProject 62.0 12.0
## 9 cqr ILM-EKF 37.8 12.5
## 10 original ILM-EKF 38.5 11.4
## 11 cqr USC-SIkJalpha 27.4 10.9
## 12 original USC-SIkJalpha 30.6 8.06
cqr_poland |>
extract_validation_set() |>
scoringutils::score() |>
scoringutils::summarise_scores(by = c("method", "model")) |>
dplyr::arrange(model) |>
dplyr::select(method:dispersion)
## # A tibble: 12 x 4
## method model interval_score dispersion
## <chr> <chr> <dbl> <dbl>
## 1 cqr epiforecasts-EpiNow2 1.94 1.21
## 2 original epiforecasts-EpiNow2 1.32 0.418
## 3 cqr EuroCOVIDhub-baseline 6.46 5.73
## 4 original EuroCOVIDhub-baseline 3.34 2.69
## 5 cqr EuroCOVIDhub-ensemble 2.34 1.87
## 6 original EuroCOVIDhub-ensemble 0.964 0.355
## 7 cqr IEM_Health-CovidProject 4.68 4.22
## 8 original IEM_Health-CovidProject 0.908 0.222
## 9 cqr ILM-EKF 1.71 1.35
## 10 original ILM-EKF 0.862 0.501
## 11 cqr USC-SIkJalpha 2.77 1.89
## 12 original USC-SIkJalpha 1.65 0.0410The discrepancy is striking! This indicates a sudden change of the COVID situation in Poland right around where we splitted the data. We compare the development of COVID Cases between Poland and Germany
hub_cqr |>
dplyr::filter(location %in% c("PL", "DE"), method == "original", model == "IEM_Health-CovidProject", target_type == "Cases", quantile == 0.05, horizon == 1) |>
dplyr::select(location_name, forecast_date, true_value) |>
tidyr::pivot_wider(names_from = location_name, values_from = true_value) |>
print(n = Inf)
## # A tibble: 29 x 3
## forecast_date Germany Poland
## <date> <dbl> <dbl>
## 1 2021-03-15 120. 388.
## 2 2021-03-22 137. 487.
## 3 2021-03-29 125. 510.
## 4 2021-04-05 149. 362.
## 5 2021-04-12 175. 324.
## 6 2021-04-19 166. 199.
## 7 2021-04-26 160. 124.
## 8 2021-05-03 129. 80.5
## 9 2021-05-10 78.6 59.8
## 10 2021-05-17 69.1 33.3
## 11 2021-05-24 38.3 18.0
## 12 2021-05-31 26.9 9.09
## 13 2021-06-07 18.6 6.37
## 14 2021-06-14 8.83 3.66
## 15 2021-06-21 5.51 2.46
## 16 2021-06-28 4.72 1.70
## 17 2021-07-05 6.17 1.42
## 18 2021-07-12 9.78 1.58
## 19 2021-07-19 12.8 1.87
## 20 2021-07-26 18.0 2.30
## 21 2021-08-02 22.8 2.73
## 22 2021-08-09 35.0 3.18
## 23 2021-08-16 55.1 3.50
## 24 2021-08-23 76.8 3.99
## 25 2021-08-30 98.3 5.62
## 26 2021-09-06 79.0 7.93
## 27 2021-09-13 77.4 11.1
## 28 2021-09-20 66.3 13.7
## 29 2021-09-27 68.3 18.9
p1 <- cqr_poland |>
plot_intervals(model = "IEM_Health-CovidProject", target_type = "Cases")
p2 <- hub_cqr |>
dplyr::filter(location == "DE") |>
plot_intervals(model = "IEM_Health-CovidProject", target_type = "Cases")
p1 + p2In both countries the incidence dropped drastically during the summer months of 2021. While the number of Cases increased again in autumn in Germany, the incidence in Poland stayed very low (at least in the dataset). Thus the distribution while training CQR looks very different from the distribution of Cases during inference.
Since the plot shows 1 week ahead forecasts the original predictions could be adapted quite fast based on human knowledge about the current situation at that time. CQR in contrast adjusts slowly to the distribution shift.
We exclude Poland from the further analysis to obtain more meaningful visual illustrations for the remaining countries. Let’s display the same picture from before without Poland:
cqr_no_poland <- hub_cqr |>
dplyr::filter(location_name != "Poland")
df_mod_loc <- eval_methods(cqr_no_poland, summarise_by = c("model", "location_name"))
plot_eval(df_mod_loc)For all other countries CQR has the largest improvements for the USC-SIkJalpha model that contains predictions from a research group of the University of California.
In case of the UK Data we could improve forecasts most when considering large forecast horizons, quantiles in the tails of the predictive distribution and Cases instaed of Deaths. We now investigate if these effects persist for the European Forecast Hub, but now stratified by country:
df_hor_loc <- eval_methods(cqr_no_poland, summarise_by = c("horizon", "location_name"))
plot_eval(df_hor_loc)
df_quant_loc <- eval_methods(cqr_no_poland, summarise_by = c("quantile", "location_name"))
plot_eval(df_quant_loc)
df_target_loc <- eval_methods(cqr_no_poland, summarise_by = c("target_type", "location_name"))
plot_eval(df_target_loc)Indeed, the same trends are identifiable, although to a much more moderate degree.
After removing Poland form the data now Croatia seems to be an outlier, where CQR does not improve the interval score across many scenarios.
hub_cqr |>
dplyr::filter(location_name == "Croatia") |>
extract_validation_set() |>
scoringutils::score() |>
scoringutils::summarise_scores(by = c("method", "model", "target_type", "quantile", "horizon")) |>
dplyr::filter(target_type == "Cases", horizon == 2, quantile == 0.2) |>
dplyr::arrange(model) |>
dplyr::select(method:dispersion)
## # A tibble: 12 x 7
## method model target_type quantile horizon interval_score dispersion
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 cqr epiforecasts-… Cases 0.2 2 19.2 12.5
## 2 origin… epiforecasts-… Cases 0.2 2 19.3 11.6
## 3 cqr EuroCOVIDhub-… Cases 0.2 2 29.3 27.5
## 4 origin… EuroCOVIDhub-… Cases 0.2 2 20.3 12.0
## 5 cqr EuroCOVIDhub-… Cases 0.2 2 12.3 10.3
## 6 origin… EuroCOVIDhub-… Cases 0.2 2 11.5 8.97
## 7 cqr IEM_Health-Co… Cases 0.2 2 15.0 11.8
## 8 origin… IEM_Health-Co… Cases 0.2 2 13.8 8.35
## 9 cqr ILM-EKF Cases 0.2 2 21.4 18.1
## 10 origin… ILM-EKF Cases 0.2 2 20.8 14.1
## 11 cqr USC-SIkJalpha Cases 0.2 2 20.7 12.9
## 12 origin… USC-SIkJalpha Cases 0.2 2 19.6 4.15In this case, however, this performance drop is exclusively caused by the EuroCOVIDhub-baseline model, where CQR increases the interval width by a large amount.
For this specific covariate combination the new adjusted intervals contain the true value only in few more situations than the original predictions at the cost of a much lower precision:
hub_cqr |>
dplyr::filter(location_name == "Croatia") |>
plot_intervals(model = "EuroCOVIDhub-baseline", target_type = "Cases", horizon = 2, quantile = 0.2)